This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
HOW TO READ THESE TABS
This tab will demonstrate cleaning a simulated Ebola case linelist.
Text about cleaning data, approaches, etc. renaming replace missing with dealing with cases (all lower, etc) case_when() factors
Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)
You can view the original raw dataset below:
Text here about a cleaning pipeline of dplyr verbs… order is important
Variable names are used so often, it is best that they have “clean” syntax. We suggest the following:
The names of the raw linelist are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).
names(linelist_raw)
## [1] "case_id" "generation" "infection date" "date onset"
## [5] "hosp date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "fever"
## [17] "chills" "cough" "aches" "vomit"Note: To use a variable names that include spaces, surround the name with back-ticks, for example: linelist$`infection date`
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).
The function clean_names() from the package janitor is very useful. Here is an online vignette
# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)
## [1] "case_id" "generation" "infection_date" "date_onset"
## [5] "hosp_date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "fever"
## [17] "chills" "cough" "aches" "vomit"Re-naming variables manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new variable name is given before the old variable name.
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the variables names have been changed:
Often the first step of cleaning data is selecting the variables you want to work with, and their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.
Within select() you can do the following:
# linelist dataset is piped through select() command, and prints just the variable names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever"select() which variables to remove by placing a minus symbol “-” in front of the variable name (or a vector of variable names). It will keep all others. Inside select() you can use normal operators such as : for consecutive values, c() to list values/variables, ! for opposite, & for AND, and | for OR.linelist %>%
select(-c(fever:vomit)) %>%
names()
## [1] "case_id" "generation" "date_infection"
## [4] "date_onset" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital"
## [10] "lon" "lat" "infector"
## [13] "source" "age" "age_unit"everything() to signify all other variables not specified:linelist %>%
select(case_id, date_onset, date_hospitalisation, fever, everything()) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever" "generation" "date_infection"
## [7] "date_outcome" "outcome" "gender"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "chills" "cough"
## [19] "aches" "vomit"As well as everything() there are several special functions that work within select(), namely:
everything() - all other variables not mentioned
last_col() - the last column
starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
contains() - variables containing a character string. Example: select(contains("time"))
matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
num_range() -
any_of() - matches if variable is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))
where() - applies a function to all variables and selects those which are TRUE
select as a standalone command
See section on object classes
Here we want to ensure that the class of each variable is appropriate, so we’ll add it to our cleaning pipe chain.
The class of the “age” variable is character. To perform analysis, we need those numbers to be recognized as numeric!
The class of the “date_onset” variable is also character! To perform analysis, these dates must be recognized as dates!
Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!
##
## 15 April 2014 2014-04-07 2014-04-21 2014-04-25 2014-04-26
## 1 1 2 1 1
## 2014-04-27
## 1
This means before we can classify “date_onset” as a date, we must fix this value. We can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date.
The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
# fix incorrect values
# old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the variables
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d"))Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! The date format = entered is often a source of problems.
After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter()
filter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]
We advise creating new variables with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use the base R style to create a new variable.
As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next. The verb mutate() used to add a new variable or modify an existing one.
Below are some example of creating new variables with mutate(). The syntax is: new_variable_name = value or function. It is best practice to separate each new variable with a comma and new line.
linelist <- linelist %>% # creating new, or modifying old dataset
mutate(new_var_dup = case_id, # new variable = duplicate/copy another variable
new_var_static = 7, # new variable = all values the same
new_var_static = new_var_static + 5, # you can overwrite a variable, and can modify a variable multiple times
new_var_calc = (age / 12), # new variable = a calculation
new_var_paste = paste0(hospital, " (", date_hospitalisation, ")") # new variable = pasting together values from other variables
) Scroll to the right to see the new variables:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )The verb transmute() adds new variables just like mutate() but also drops/removes all other variables that you do not mention.
linelist <- linelist_raw %>% # creating new, or modifying old dataset
transmute(new_var_dup = id, # new variable = duplicate/copy another variable
new_var_static = 7, # new variable = all values the same
new_var_static = new_var_static + 5 # you can overwrite a variable, and can modify a variable multiple times
new_var_calc = (age / 12) + months # new variable = a calculation
new_var_paste = paste0(district, "(", province, ")") # new variable = pasting together values from other variablesModifying values in a variable is covered in this tab (LINK HERE)
Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),
coalesce()
if_else(), ifelse()
recode
Replace
Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html
Taken from website above:
Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.
If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new variable by assigning it a value. In the command below, the variable new_var does not exist until after the command is executed. In this simple example the variable is assigned the static value “new value”, so for all rows the value will be “new value”.
You can also give the new variable a dyanmic value as shown below, or using the case_when() command explained in the next tab.
case_when())TODO tutorial on using case_when()
For example, creating age groups cut()
case_when()
age_categories() (R4Epis package)
by percentile
WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)
Within a group, indicate/convert to the highest value in the group
Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)